INTERNATIONAL APPLICATION PUBUSHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(ll)liitenistiona] PoUkatioi Number: WO 94/10638 

(43) IntenurtkHial PuMkatiofl Date: 1 1 May 1994 (! L0S.94) 



(21) Iptenuitiowl AppHcttioii Number: PCT/AU93/00573 

(22) Internatlooal Filing Date: 5 November J993 (05. 1 1.93) 



5 November 1992 (05.1 152) AU 



(71) Applicant (for all designated Slates except US): THE COM- 
MONWEALTH OF AUSTRALIA [AU/AU]; CZ-The 
Secretaiy, DepartmeDt of Defence, AnzBC Park West 
Building, Constitution Avenue Canbnra. ACT 2601 
(AU). 

(TZ)lBventon;aiid 

(75) Inventon/AppHcants (for US only) : MARWOOD, Warren 
[AU/AU]; 6 Selangor Avenue, Fairview Park, S.A. 5126 
(AU). CLARKE, Allen, Patrick JAU/AU]; 22 Sher- 
IxMime Road, Medindie Gardens, S.A. 5081 (AU). 
CLARKE. Robert. John [AU/AU]; 20 Albert Street, 
Dulvnch. S.A 506S (AU). 



5, S,A. 5000 (AU). 



LU, LV, MG, MN, MW, NL, NO, NZ, PL, PT, RO, 
RU, SD, SE, SK. UA, US, UZ, VN, European patent 
(AT, BE, CH, DE, DK. ES, FR. GB. GR, IE, IT, LU, 
MC, NL, PT, SE), OAPI patent (BF, BJ, CP, CG, CI, 
CM. OA ON, ML, MR, NE. SN, TD, TG). 



Wtit biumOonal search report. 



(54) Title: SCALABLE DIMENSIONLESS ARRAY 



Input registers 
'Register Rle 




X bus (operands^ 



r — Y bus (operands) 
-\;^Rbus (results) 

Output Registers 

Artthmetfc Unit 



Yout YWSout 



YHSout 



A processing elejiniEnl for use in a scalable array processor chip whirfj can perfonn a number of point malrix operations for 
confonnaWe matrices of arbitrary order on an array of fixed size. The processing etement includes a number of input and output 
registers, storage registers, a shifter/aormaliser, and arithmetic unit (datapath etemenu) and a control sequendng unit, Ute data- 
path elements are connected by a number of parallel data buses, with the input and output registers connected by serial interfaces. 



FOR TUB FVKPOSBS OF INFOKMATION ONLY 

Codes used to identify States party to the PCT on the front pages of pamphteis publishing international 
applications under tlie PCT. 




wo 94/10638 



PCT/AU93/O0S73 



1 

SCALABLE DiMENSIONLESS ARRAY 
TECHNICAL RELD 

This invention relates to the general field of digital computing and in particular 
to a scalable array of globally clocked multiply/accuniuiate floating point 
5 processing elements. 

BACKGROUND ART 

Kung and Leiserson rSystollc Anays (for VLSI)' in Sparse Matrix Proceeding 
1978, Soc. for Industrial and Applied Mathematics, 1979] presented the cx}ncept 
of performing matrix operations using arrays of simple processing elements. 
1 0 Each processing element implements a simple primitive operation. As an 
example, at a given time, a processor may: 

read the Input data vector {aOn),b(in),c(ln)}, 

perfomi an arithmetic operation such as c(out) = a(ln)b{ln) + c(ln), 

write the output data vector {a(out),b(oul),c(out)}. 

1 5 The proces^ng elements are connected only to their nearest neighbours, and 
so the problems of routing, fan-out and dock skew are minimised. Data and 
results move synohronousfy through tiie anray of elements. The name applied 
to tills approach to computation witii anays of Identiceri processing elements is 
systoRc. 

20 An example algoritiim quoted by Kung and Leiserson was the matrix product 
Using a systolic array In which the processing elements executed tiie local 
algorithm presented above, known as an inner-product-step algoritiim. tiiey 
showed tiiat the system level algorittim which this implemented was a matrix 
product of computational order 0(N), ratiier than ttie computational order 0(N3) 

25 for the matrix product implemented on a conventional scalar architecture. The 
matrix product is represented simply as 

C = AB 

wH^re A, B and C are matilces of a size equal to ttie order of tiie array. 
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Confonmal matrices can be mulfiplled with this array under certain restrictions if 
the resulte are re-circulated through the array, sbthJ larger order matrix products 
can be computed If the task Is partitioned. Spelser, WWtehouse and Bromley 
['Signal Processing Applications for Sy^oOc Arrays', Record of the 14th 

5 Asilomar Conference on Circuits, Systems and Computers, IEEE No. , 
80CH1 625-3, 1 980 ) subsequently dwiranstrated flie use of inner-product- 
accumulate processing elemente for the same matrix product algorithm. In this 
case, the results are formed In-place, and do not move between processing 
elements. The only difference t)etween the description of this algorithm and the 

1 0 algorithm described above Is that the Input and output phases of the ^gorithm 
do not include the reading and writing of c(in) and c(out) respectively, and that 
an explicit unload phase must be added at the end of the algorithm to return the 
results. 



The prirrrary advantage of systolic processing over convention^ linear 
1 5 processing Is speed. The systolic architecture uses the fact that for matiix 
multiplication, the same operand data may be reused many times In the 
computation of cross-product temns, thereby making better use of tfie available 
data bandwitt). The improved performance, however, comes at the cost of 
flexibility. Prior art devices have been designed for very specific appiicattons 
20 such as Fast Fourier Transform computations or video signal processing. An 
advantage of the present Invention Is the adallity of the same device to be useful 
for a wide variety of matrix computations without the need for hardware 
reconfiguration. The device is particularly useful when implernented as an 
architectural enhancement to a computer in which case the processing power 
25 of the computer is considerably enhanced. 



DISCLOSURE OF THE INVENTION 



It is an ot>ject of this Irwention to provide a processing elerrjent for use In a 
scalable array processes- whteh Is able to implement a set of primitive floating 
point matrix c^erations for conformable matrices of arbitrary order on an array 
30 of fixed size. ^. 

It Is a further object to provWe a scalable an-ay processor cfiip which is able to 
perform one or more of the following functions : 

cwnpute the product of two matrices 
'•'^ compute the element-wise (Hadam^u'd) product of two matrices 
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^>iT^utd th6 sutn of two matricss 
permute the rows and columns of a matrix 
transpose a matrix. 

it is a still furtiier object of this invention to at least provide the public with a 
5 useful alternative to eidsting systoiic de\^ces. 

Therefore, according to pertiaps one form of this invention, although this need 
not be the only or indeed the broadest tomi, there is proposed a processing 
element suitable for use in a scalable array processor comprising : 
at least one input register means adapted to receive and process serial 
10 operands In the fomi of {instruction, data) 2-tupies; 

a memory means adapted to store temporary results and constants; 

a computing means adapted to pertomn togicai operations; 

an output register means adapted to output results iram the processing 

eiement; 

1 5 a control and sequencing means adapted to control the operation of the 
processing element; 

a plurality of data buses adapted to provide communication between the 
piumlity of means. 

In preference ^e computing logical means consists of a shifterAiormaliser 
20 means adapted to shift/normalise data and an arithmetic means edited to 
perfomi logical operations such as but not limited to addition, subtraction and 
partial multiplication operations. 

In preference the processing element is adapted to perform floating pdnt 
multiply, floating point add and floating point muttipiy-accumuiate which is used 
25 for Inner product accumulate operations. 

In preference tiie input register is adapted to ou^ut a copy of tiie input operand 
t>it witii a one dock period delay. 

In preference tiiere are N input registers and ttie processing element is sulteble 
for use in a N«dimensional scalable array processor. 

30 In preference N can be any positive integer. 
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in preference the input registers convert the input serial ctata to an internal 
representation comprising separate sign, fraction and e)qpcNient 

In preference the memory means consists of read only memory for storage of 
constant and a read/write memory for storage of temporary results. 

5 In preference the shiftet/normalizer means Is adapted to perfonn binary 
weighted barrel shifting where^ the shifter ftjndion Is detenmlned by a control 
input to ttie ^IfterAioFmalizer and the normalizer furKntion effects a data 
deperKient shift of up to 15 bits within a single dock cyde. 

In preference the arithmetic means implements logical operations such as but 
10 not limited to floating point addition, rrailtlplicatlon and rmjdflply-'accumulate 
algorithms using a parallel micrcK:oded data path. 

in preference the arithmetic means comprises a logical unit »jch as but not 
limited to an input>nrujltiplexer, an adder, an output shifter, flags unit and a 
contnsi unit 

15 in preference tfie output register can be loaded in parts to enable the 

conversion from ttie internal representation to IEEE 754 floating point format 
The output register can be parallel loaded from tiie arithmetic rnear^ or can be 
serially toaded from a serial source. The register Is urdoaded serially. 

In preference the control and sequendng means Indudes timing and control 
20 logic, a microcode ROM, address decoders, branch control logic, flags logic, 
instnjction register, instoiction decoder and a program counter. 

In preference there are tiiree data buses, an X bus, a Y bus and a R bus. The X 
and Y buses are cedled operand buses and the R bus is called the result bus. 

In preference ^e processing elem^ has an accumulator comparison rr^ans. 

25 in anottier torni the invention consriste of a scalable array processor chip 
conr^rising an array of processing eiemente each seud element tnduding : 
at least one input regteter means adapted to receive and process serial 
operands in the form of {Bistojction, data} 2-tupie8; 
aiihtemory means adapted to store temporary results and constants; 
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a ^fter/normaiizer means adapted to shift or normalize datei; 
an Euittvnetic means adapted to perfomn logical operations such as but not 
limited to addition, subtraction and partial multipUcatiCMrt operatiorts; 
an output register means adapted to ou^ut results from the processing 
5 element; 

a control and sequencing means adapted to control the operation of ttie 
processing element; 

a plurality of data buses adapted to provide communication between the 
plurality of means; and 
10 wherein each processing element has means for communication only with 
adjacent elements. 

In preference the array of elements comprise an inter(»nnected lattice of at 
least one dimension. 

In preference the anay of elements comprise an Interconnected lattice of at 
1 5 least two dimensions. 

In preference tiie scalable array processor chip Is adapted to perform at least 
the functions of computing the product of two or more matrices, computing the 
element-wise product of two or more matrices, computing the sum of two or 
more matrices, permuting the rows and columns of a matrix and transposhig a 
20 matrix. 

In a yet further fomn of &ie invention tfiere is proposed a computing appars^ 
comprising a host processor, at least one scalable array processor chip and a 
plurality of data fomnatters wherein the scalable anay processor chlp(s) and 
plurality of data formatters are adapted to perfomn matrix operations odienwise 
25 perfonned by the host processor. 

In preference the apparatus Includes a memory cache adapted to store 
operand data and temporary or intermedtat© resulte. 

In a still further form of the invention there is prc^osed a method of performing 
matrix (^rations comprising the steps of : 
30 (a) providing a plurality of processing elements in the fonn of an array 
adapted to perfomi systoBc processing operations; 

Ooym receiving operand matrix date for processing fiwn a host or date source; 
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(c) formatting tie operand matrix data in a data formatter by adding an 
Instruction to form an {Instrucflon, data) 2-tuple; 

(d) transferring sets of 2-tuples to tlie processing element array to cause tfie . 
processing elements to process the data in accordance w«tii the instruction; 

5 (e) repeating tiie steps (b) to (d) a number of times said number of times 

being dictated by the matrix operation being performed; 

(f) unloading ttie results of the matrbc operations into an output reeuit 

register of the processing elements (imder control of an instniction spedfied by 

the operand 2-tuple); 
1 0 (g) transfening the contents of the output result registers held within the 

plurality of processing elements back to data fonnatters as result wavefronts; 

(h) storing the result wavefront data t»aci< to a host or data sink; and 

(i) repeating tiie steps (f) to (h) a number of times s£dd number of times 
being dictated by the matrix operation being performed. 

15 in preference, ttie data fbnnatter is of tiie type tiie sublect of co-pending patent 
application number PL5696 entitled "DATA FORiy/IATTER". 

The sets of 2-tijpies are known as operand wavefronts. During tiie unload step, 
an indteation is provided to the data fomnatter that the unload operation is 
occurring, ailovnng synchronization of data transfers to and from the processing 
20 array. 

During step (g), tiie results are transmitted in sets containing one result from 
each of the processing elements at the left edge of tiie array. Such a set is 
known as a result wavefront. 

There may be as many result wavefronts held witiiin tiie anay as there are 
25 columns of processing elements. 

BRIEF DESCRIPTION OF THE DRAWII^K3S 

For a l>etter understanding of this invention a preferred embodiment will now be 
described witii reference to the attached drawings in which : i 

FIG. 1 is a schematic diagram of one pnxsessing element; 

30 4 6 FIG. 2 is a schematic diagram of one embodiment of a systolk; array 
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processing element chip uttBsi'ng the elements of RG. 1 ; 

FIG. 3 is a schematic dagran of a first emt)odiment of a processing 
apparatus utilizing the chip of FIG. 2; 

FIG. 4 is a schematic diagram of an inner-product-step processon 

5 FIG. 5 is a sdiematic diagram of an inner-product-accumulate processor; 

FIG. 6 is a schematic example of the entry of operand wavefronts to a 
processor array; 

FIG. 7 is a schematic example of the unloading of result wavefronts from 
a processor anray; 

10 FIG. 8 is a schematic example of the entry of element-wise operand 
wavefronts to a processor anay; 

FIG. 9 is a schematic example of ttie unloading of element-wise result 
wavefronts from a processor arra^ arul 

FIG. 10m is a schematic diagram of a second embodiment of a processing 
15 apparatus. 

BEST lUIODE FOR CARRYING OUT THE INVENTION 

Referring now to ttie drawings In detail, each processing element consists of a 
number of input registers, a memory consisting of a register file and a constant 
ROM, a shifter/nomiallzer, an aritiimetic unit, output registers and a control and 
20 sequencing unit. 

The datapath elements ( input registers, memory, shifter/nomrjaiizer, aritiimetic 
unit and output registers) are interconnected by tfiree parallel data buses. In 
addition serial interfaces are provided to and from each of the input registers 
and the output register to allow communication between processing elements 
25 and to facilitate construction of arbitrarily large anays of processing elements. 

AH'iarray computes 2N2 floating point <verations (1 multiply and 1 accumulate 
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for each processing element In the array) In the time taken to fetch 2N 
operands. To ensure that the com]HJtation Is bandwidth limited, each 
processing element needs only compute at a rate of one floating point operation ^ 
every N data fetches. This fact leads to the conclusi<wi that veiy cheap 
5 processing elements can be used In the array. 

A sdiematic of the processing element is shown in FIQ. 1 . The choice of a 
single microcoded datapath eund sequential algorttiims to perfonn the floating 
point operations means ttiat the si2» of ttie processing element can be kept 
small. Many such processing elements can tiierefore be placed on a single 
10 chip. The fact that processing elements implemented in this manner are slower 
than those built using fully parallel algorittims and architectures becomes, 
insignificant as the size of tiie anay is increased. This is because ttie 
processing perfonnance achieved Is limited by tiie data bandwidtii (and.anay 
size), not by tiie computation rate for a single processing element 

1 5 The function^ performed by each module in tiie processing element are 
described beiow. 

Input Registers: The input registers receive serial operands in tiie form of 
{instruction, data} 2-tuples from adjacent processing elements to tiie left or top, 
or in tiie case of processing elements at the top or left boundary of the array, 

20 from operand data formatters. They then separate tiie Instruction and refomnat 
the data to an Internal representation consisting of separate sign, exponent and 
fraction words. This data is available to the processing element via the X and Y 
internal data buses. The input registers also compute tiie sign of tiie pnsduct of 
the two ir^suts. check for zmo operand data and Implement the Bootii encoder 

25 used during multiplication operations. 

Memory: The memory consists of a Register RIe and a Constant ROM: The 
register file is a 5 word memory used to hold tiie product (trath fraction and 
exponent), accumulator (botii fraction and exponent) and temporary nssults. ; 
The product and accumulator registers can be swapped under tiie control of 
30 microcode to teciiitate efficient implementation of the pre-aiignment operation in : 
tiie floating pdnt ackfltion and accumulation algorithms. The registers can be 
loaded from tiie R bus, and tiieir contents can be read from either tiie X or Y 
tmses. The Constant ROM stores a number of constants that are used during 
tiaie implementation of the floating point algorlttims. These can be read via tiie X 
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and Y operand data buses. 

SWfter/Normallzsr: Under rrtcrocode control, the shifter can either operate as a 
shifter (for pre-alignment of fracMons before addition) or a nomrwJlzer, When 
acting as a shifter, it performs right-shift operations on one operand datum (The 
5 X operand). The amount by which the datum Is shifted is determined by a 

previousiy computed shift that Is applied to the second input to the shifter (the Y 
operand). The shifter can shift 0 to 1 5 bits right within one clock cycle. When 
acting as a normalizer, the Shifter/Normallzer perfomris either a right shift by 
one bit, or a left shift by 0 to 15 bits within one cyde. in this case the shift is 

1 0 applied to the X operand Input to the shifter and is Independent of the Y 
operand Input. The value of the shift is data dependent. A right-sWft is 
performed if the value on the X input is the result of a computation which iiad 
overflowed (such as in the case of addition of two nomnalized numbers having 
the same exponent). Othenvlse, a left-shift is perfomned. When acting as a 

1 5 normalizer, the Shifter/Nonmalizer at the same time computes the offset 
(exponent offset) that nriust be appOed to the exponent of the number being 
normalized in order to compensate for the shift that is applied. Shifting and 
normalization operations that require i^hSfts of greater thsm 51 bits can be 
Implemented by multiple passes through the shifter/honnalizer. 

20 Arithmetic Untt: The arithmetic unft consists of an input multiplexer, an adder, a 
result shifter and a flags unit There are two paraliet data Inputs (X and Y) to the 
arithmetic untt and a single paratiei data output (R). The input multiplexer can 
be used to complement and/or left-shift the X operand under control of the 
Booth encoding logic contelned in the input registers. This feature is used In the 

25 implementation of multipilcaflon using a mcxJIfied Booth algorithm. The 
multiplexer can also be controlled directly tiy the processing element's 
microcode to facilitate the implementation of addition, subtracli<»i and data- 
move operations. 

The adder performs conventional two's complement addltfon. "Hie cany input to 
30 the adder can be controlled by either the booth encoder logic or the processing 
element's microcode. Both addition and subtraction can be perfonned byttie 
combinaticM^ of input-multiplexer and adder. Under ointrol of the processing 
element's microcode, the output shifter latches either the result of the 
computation or the result divided by 4. This feature Is used during partial 
35 muftlpllcation operations. The latched resutt remains valid until the next time the 
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arithmetic unit is used. The latched result can be written onto the result bus R. 
A number of flags are set or cleared depending upon the result latched by the 
arithmetic unif s output shifter. These flags include the sign of the result, 
whether or not the result is zero, and whether or not tiie result is less than or 
5 equal to 15 (used to support multi-pass shifting during addition pre-alignment). 

Output Registers: The output register module is used to communicate the 
results of computations back from the processing element toward ttie left 
tKHindary of an anay of processing eiemente. The output register can be 
parallel loaded from the arithmetic unit or can be serially loaded from a serial 
10 source (often the serial source is from another processing element's output 
register). The output register is unloaded serially. The output register is parallel 
loadable by the arithmetic unit in tiiree parts: sign, exponent and fraction. This 
facilitates conversion from the internal data representation to IEEE 754 floating 
point format 

15 During the time when the arithmette unit is converting the accumulator contents 
Into IEEE floating point fbnnat, a flag is set to Indicate tiiat a register unload is 
in progress (UIP). 

Control and Sequencing: This module includes a nu'crocode ROM, a program 
counter, branch rontroi logic, flags logic, an instruction register, an instruction 

20 decoder, address decoders and timing and control logic. This circuitry is used to 
sequence the processing element through its operations. Each clock cycle, the 
microcode ROM issues a microinstruction to the processing element's datapath 
units, and tiiereby controls the function and timing of the data operations being 
performed. Data and control flags fed to the branch control logto enable tiie 

25 processing element to perform data dependent operations required for 
implementation of the floating point algoritiims. Relds of tf»e instruction 
transmitted serially to the processing element as part of tiie finstnjction, data} 
2-tejple are also f^ to tine branch contittl logic and flags logic of tiie Control and 
Sequencing Unit These also determine the sequence of mk:rotnstructions_ 

30 exeojted by tiie processing element The instructions specified In tiie 
{instnjction, data} 2-tupie are distirK:t from tiie set of microinstixtctions 
implemented by tfie processing element The instmctions specified in the 
{instruction, data} 2-tupie control the ifow of execution of the processing 
elemenf s microcode; 
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The internal data representation used by the processing element uses two 32- 
bit data words to represent each IEEE single pradslon number. One of the two 
words represents the mantissa in 2'$ complemant form, normalized to bit 29. 
The second wonj represents the exponent u^ng an exponent bias of 229. This 
5 fomiat provides better resolution in ttiB mantissa than IEEE single precision 
format, and the use of a large exponent field virtually guarantees that exponent 
overflow cannot occur. 

Within each processing element, multiplication is facilitated by the inclusion of a 
modified Booth encoder and multiplexer. The denormallsatlon and 
1 0 normalisation operations required by the floating point accumulation or addition 
algorithms are facilitated by the repeated application of the shifter circuit which 
can shHt up to 15 bits in a single cyde. 

FIG. 2 shows a scalable array processing chip composed of a 5 x 4 rectangular 
array of single precision floating point processing elements which accept serial 
1 5 dataflow operands, and which perform a set of operations on those operands. 
Each operand consists of a 5-blt InstnjcBon followed by an IEEE standaitj 
single precision number. Each processing element Is a microcoded ALU with a 
32-bit parallel datapath that Includes dedicated hardware support for floating 
point multiplication and addition algorithms, 

20 The army of processing elements is clocked synchronously. The tiiree bit- 
serial links provide communteation between processing elements. One Hnk Is 
provided for each of the two input X and Y operands and one for the output, or 
result operand R. As shown in RG. 2, input data Is transfened from left to right 
across the array, and output results are transmitted from light to left. Chips can 

25 be cascaded arbitrarily In both X and Y directions. 

The operation of the scalable anay processing chip is described with reference 
to the system block diagram shown In FIG. 3. The data interface provides 
communication between the scalable anay chip and the host system. The data 
fomnatter elements are described separately In a co-pending application 
30 number PL5696 entitled DATA FORMATTER. 

The I/O archftecture of each processing element consists of two orthoganaf 
data transmlssfon paths for X and Y operands, each consisting of a single one- 
bit dfelayceH and a 32-bit data storage register. The X operand path also 
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includes a 5 bit instruction register. Data Is Input to the array as a sequence of 
{instnjction,data} 2-tu|;rfes. These are split into separate instnjcSon and data 
words on receipt by the input registers. 

Each X data operand consists of a 5-bit instnjction followed by a single 32-bit 
5 IEEE 754 standard floating point number. A variable length gap of several 
clock periods may be present between operands for I/O synchronisation. The 
(H}erEuid is transrrritted in bit serial fonn into the processing element When me 
entire {Instaiction, dala} Z-HufslB is held within the processing element, it is 
cross-loaded into parallel holding registers. The instnjction is decoded and 
1 0 used to control the execufon of the floating point algorithms. The data is 
converted by iiardware into tfie internal extended fomnat. The internal fomnat 
has both extended predsion and extended d^amic range when compared with 
the IEEE standard. 

The btt'seriai data is bit^sicewed on entry to adjacent processing elerrients on 
1 5 the array boundary. This sl<ew is preserved between adjacent elements within 
the array by passing the data through the sfaigle-bit delay stage in each 
processing element before re-transmittlng it to the next (xocessing element. 
The use of serial data both mbiimises the I/O pin count at the array boundary 
and allows adjacent processing elements to both conwnence and conclude their 
20 computations with a time differential of only one bit period. The advantage of 
the bit-sicewing approach over a broadcast architecture is that there is no need 
to drive long buses with large buffers and thereby provides tiie capability for 
arbitrary expan^n of tiie eurray. 

Bit sicewing has the advantage over word-sicewlng in tiiat less wavef ronts are 
25 required to complete a processing tadc. The bit-sicewed approach therefore 
resiits in the minirr^saticm of job time. The (XHnputaUon time is minimised for 
bofli a single job and a fob stream. 

At ttie completion of a set of computations, an operand wavefront is Issued to 
the array which causes the unloading of the results into tiie output registers of 
30 ttie processing elements. 

Qocklng of tiie scalable anay processing chips Is perfonned by a single phase 
50% duty cycle dock from which all internal timing signals are generated. The 
ctofck fe buffered on entry to the chip and Is disblbuted to each processing 
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element It Is re^xjffered wtttiin the procesdng ei^nent where it is used as a 
iocaiiy synchronous clock. In addition, each processing element generates a 
sectHid, synchronous ctodc of the same frequency but with a duty cycle 
detemiined by a self-timed drciM The secondary dodc Is used to provide 
5 tinrring Infomiatlon for bus prechargbig, data treufisfens and evaluation of 
execution imits. 

FIG. 4 shows s(*iennafically tfie Inner-product-step process described by Kung 
and Leiserson. Data is docked into each processing cell from the left and top 
edges while the results are clocked out from right to left For a matrix product 

1 0 algorithm, an Inner proiuct aosumuiate algorithm is used in preference to the 
inner product step process common in much of the prior art The Iriner-product- 
accumulate process is depicted schematically in FIG. 5. Data is again clocked 
Into the element from the left and top but in this case the result is formed in 
place. An explicit unload phase is implemented to obtain the result after the 

1 5 computatton is complete. An advantage of the inner product accumulate 
algorithm over the inner product step appn^adt is illustrated when matrix 
products are computed for matrix operands which are rectangular. The inner 
product step process requires the reclrcuiatfon of the result partial product 
matrix. In contrast the Inner product accumulate algorithm computes the result 

20 in-place, and Incurs no hardware penalties, inespective of ttie length of the 
inner products. 

The sequence of operations performed by the processing elements is 
determined by the 5 bit instaiction transmitted as part of ttie X operand. The 
five instruction fields and their function are listed in the table below. 



Instruction 


Bit No. 


Function 


ADD 


4 


Floating point add 


LDR 


3 


Convert result to lEE format and load 0/P register 


HAD 


2 


Enable result unloading ortly if active flaq set 


SDE 


1 


Set active flag if accumulator contents are non-zero 


CLR 


0 


Oear acoimulator prior to computation 



TABLE 1 
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The default operation performed by the PE {t.e., when none of the fields of the 
Instructfon ate asserted) Is an inner-pFoduct operation impfennented as a 
floating pc^nt rmiitipiy-accumuiate, tiie input X and Y operands being multiplied 
and £ux%jmuiated witii ttie contents of the accumulator. 

5 If the CLR field is asserted, the accumulator contents is cleared before the 
computation comnnences. This generally occurs for the first wavefront of a 
matrix multiplication, and also when executing eierr^nt-udse operations. The 
accumulator is deared before the computation is commenced but after tiie 
ACTIVE flag is set if (tiie SOE field) is set, and after tiie accumulator has been 
1 0 unloaded into tiie result register (if the LDR field is set). 

If the SOE field is set AND tiie value held in the accumulator (from the previous 
operation) is non-zero, an internal flag, ACTIVE (one per processing element) 
is set to incHcate tiiat tills processing element is an active element Only active 
elements are permitted to unload results during element-wise operations. 

15 If tiie HAD field Is asserted, tiie operation being perfonmed Is deemed to be an 
element-wise (hadamard) operation. If this field is set, only ttiose processing 
elements flagged as active elements (as determined by tiieir ACTIVE flags) can 
unload tiieir accumulator contents Into their outout register R. 

If tiie LDR field is asserted, tiie accumulator contents from the previous 
20 computation are converted back to IEEE fomiat and are unloaded into the 
processing elements output register. 

During the unloading process, the processing element issues a flag (UtP) to 
Indicate tfiat tiie unload is in progress. 

If the ADD field is asserted, tiie X and Y operands are added rattier than 
25 multiplied prtor to the result being stored in the cKxximuiator. The HAD and CLR 
fields must also be asserted tor matrix addition instivcti(»is. 

The element-wise operations of addition and rruMplication defined by 

C = A + B where cij = alj + blj 



C St A • B where cij = aijblj 
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are performed by first setflrig the active-element (ACTTVE) flag In a desired 
processing element This procedure is typicaHy done once during system 
Initialization. It is achieved by issuing an Instruction with the SDE {set active 
element) field asserted. When tWs occurs, processing elements that cont^n 
5 non-zero results In their accumulators set the value of their ACTIVE flag to 
TRUE. 



The procesdng edemente accept operand data and return results in IEEE 
standard hmei. internally, an extended precision format is used for both ttie 
mantissa and exponent of the partial rasulte. 

1 0 The internal fdnmate used for the representation of mantissa and exponent are 
as follows: 

2's complement mantissa sgf.fffffffffffffffffffffffffffff 

Exponent Oeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee 

Exponent Bias 001000000000000000000000000000000 



s is the sign bit of the mantissa. 4/ Is represented as 0/1 
respectively. 

g is a guard bit used to avoid mantissa overflow during 
accumulaton. 

2° f Is a fraction (mantissa) bit The mantissa is nomialized : the most 

significant fraction bit is 1 (explicit). 

Is the position of the binary point (showing that the mantissa Is 

normalized). 

e is a bit of the exponent which is held in biased form. The 
25 exponent bias Is 2a. 

If the flags in the anti-active processing elements have been set by a prior SDE 
instniciion, and an element-wise multiplicatjon of a matrix A with the unit matrix 
is executed, the result of the operatbn is the transpose of the matrix A. If an 
arbitrary orthogonal set of elements have their flags set, a permulatlon of the 
30 input matrix wffl be perfomtied by this element-wise product. 

When an unload (LDR) Instruction is received, the accumulator contents are 
cori^rted from the Internal fomriat to an IEEE standanS form. Numbers outside 
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the range that can be represented t>y the IEEE single predsion tormat are 
truncated to zero (in the case of results with large negative exponents, irKduding 
IEEE denonnaiized numbers) or limited to infinity (in the case of numbers with 
large positive exponents). In both cases, the sign of the zeros or infinities are 
5 retained (unless the result is a true zero, in which case positive zero is always 
retumied). 

The IEEE representation of the result is loaded Into a separate output register 
wMch is ooncatmated with other output registers in ad|acent processing 
elements to form an ou^ register chain. The result Is output in a serial fonti 
1 0 through this re^ster chain. 

Matrix algorithms which are elements of the set of primitive operators 
{muitiplication, addition, element-wise (or Hadamard) multipiication, 
permutation} are perfbrnied directly by the processing anay. Implementation of 
ttiese operations for operands whose dimoision exceeds tfie ^ze of tfie anay 
1 5 is possible by mathematically partifioning the operations to a set of operaSons 
which can be computed separately usbig the avsdiabie array 

For the parttoutar case where the problem size does not exceed Vne size of ttte 
array, recursive algorithms can be implemented wt^ch redrcuiate the output of 
the array back to Ite Input This can be a useftjl method to minimise memory 
20 t}andwldth requiremente In particular applications. 

If a matrix multiplication is commenced with an instruction which does not clear 
me accumulator, the result of the multipiication will be sunrvned with the prior 
result This gives a matrix multipilcation/accumulab'on capability which has 
direct application to the evaluation of complex matrix operations. 

25 FIG. 6 shows the way In which confomial matrix operands are entered into the 
systolic array. Blt-sl<ewing is indicated by the small offset between adjac«it 
rows of A aid colunvis of B. Each element of the processor array compute an 
element cij of the resist matrix C, by evaluating the irmer product 
Ci j = Xk-o ^ifc^kj • ^6 last wavefnjnt has been input to the array, 

30 the result matrix may be read from the anay. The elements are obtained in the 
order ^omi in FIG. 7. 



If tiHie processing elements on the main diagonal in the an'ay have their active 
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element flag set by an aibftraiy prior operation as shown In FIG. 8, the 
processor array can be tised for ttie element-wise operations of addition and 
Hadamard muWplicaifon. FIG. 8 shows the entiy of confoimal matrices to a 4 x 
4 subarray of the chip for the purposes of element-wise addltton or 
5 muWpfication. Only those elements shown as • are used. 

FIQ. 9 shows the relationship between rows of data which are output from the 
array after an element-wise operation. Due to the word-lengtti registers present 
in flw output register c^iain, the data Is skewed by one word-tlms plus one bit- 
time. The additional bit-time det^ is caused by the bit skewing of the Input 
10 operands. 

in a second embodiment the invention has been Implemented In a system 
hosted by a Sun SPARCstation. The matrix processor is Interfaced to the Surl 
SPARCstation via the SBus. This an^gement is convenient since it allows the 
SCAP hanJware to operate using virtual addressing, with virtual to physical 
1 5 translation bding perfonned by the SBus controller In the SPARCstation. The 
host processor and the matrix processor therefore share the same data space, 
so both can Interact with the matrix data directly. This approach does however 
have its own disadvantages, the most critical being the fact that the data 
transfer rate across the SBus tends to be quite kw due to the overheads of 
address translation. 

To compensate for this tow data rate, the matrix processor also indudes a 
cache merrwry subsystem. The cache supports buret mode data transfere 
across the SBus on cache misses and can also be used to hold frequ^y used 
operand matrices (such as coefficient matrices in transform applicafions) and to 
store temporary or intermediate results. 

A novel cache partitioning scheme has been implemented. The technique 
allows the cache to be dynamically divided into a number of regions that are 
guaranteed not to interact thereby ensuring that fetches for one matrix operand 
do not interfere with fetches for the other. The data controiters determine how 
the cache is partitioned on a per-operand/result basis (it is also possible to 
assign a cache partition to the instnjctton streams) by issuing an 8-bit space 
address atong with each address generated. Each bit of the space address can 
be set or cleared, or can take on the value of one of the generated address bits. 
In <^r system Implementafion, three bits of ttils space address are used to 
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control norvcached accesses, temporary matrix accesses and temporeuy rrratrlx 
initialization. Four t^ts are used to partition the cadie Into up-to 16 independent 
regions. Use of tiie temporary matrix control bite of the space address allows 
temporary result matrices to t>e stored entirely within the cache wi^out being 
5 written out to ttie host, in fact, such matrices are entirely invisible to the host 
pnxessor. The mawmum data throughput obtainable using the cache is 12.5 
Mwords/second. 

The two custom chips implemented during the development of tills system are 
a processing element anay chip and a data controller chip. Both chips were 
1 0 designed using a generic 1 J2 micron double layer metal CMOS process rule>set 
and were retergetted for^rication udng a 1.0 micron process using a gate 
^rink. 

The processing element an^y chips are full custom integrated circuits each 
containing an array of 4 rows by 5 columns of floating point processing 
1 5 elements. Because the overall computation rate Is limited by the available data 
bandwidth, the speed of computation of the processing elements if not overiy 
important. Therefore, tiie architecture has been designed to yield processing 
elements (PEs) tiiat are physically small rather than being particulariy fast 
Each complete floating point unit occupies only 2.7sq mm. 

20 The processing element does not include a dedicated hardware multiplier, but 
is imiMentented as a simple microprogrammed 32-bit datapath with hardware 
support to aid the floating pofant computations, as illustrated in RG. 5. 

The PE hardware hcorporates a bootii encoder and rrujHiplexer to ^Btats 
multiplication using an iterative modified boofli algorithm, and also a 

25 shifter/normalizer that can be used for pre-addition alignment as well as post 
addition nomrialization. When used as a nonmalizer, the shifter has the ability to 
compute the amount by which the exponent must be adjusted during the same 
time that the normafization occurs. Computation of the floating point arithmetic 
operations (multiply/accumulate, multiply or add/subtract) are completed within 

30 40 clock cycles. 

The processing element anay chip accepts IEEE single precision floating point 
numbers as inputs and feeds results back tirough the data controllers in tiie 
same format Internally, a proprietary number representation is used, including 
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a 31 bit exponent that virtually etiminatss the possibility of exponent overflow. 

The chips operate at 20MHz cktdk speed. acNaving around 20 MFLOPS peak 
performance per dtip. Processing anays of artntrary size can be built with no 
external components dmply by staddng the dilps to form a two dimensional 
6 array. The pin-out of the chip is such that 1 -to-l connection of Inputs and 
outputs of adjacent chips can be made. Ail communication to and from the 
an-ay is via the edge elements of the anay. Operand data enters ttie array on 
left and top edges. This data is icnown as the X and Y operand data 
respectively. Hie result data (R) emerges from the left edge of tfie army and 
10 can be extracted independent from the appScafion of operand wavefronts 
(that is, the operand and result streams operate In parallel). 

The only global signals in the array are dock and reset Because all 
communication Is local (nearest neighbour only), the system is insensitive to 
Clock skew from one side of the array to the other. The only requirement Is that 
15 the skew between adjacent PEs is kept under control. This can be readily 
achieved by ordedy layout of dock routing and/or Insertion of ck>ck buffering. 

The processing elements are low power devices due to their architecture. The 
entire chip containing 20 processing elements dissipates less than half a watt 
This corresponds to less than 5mA per processing element at 20MHz 
20 operatton, or 5mA per MFLOP. 



Number of Transistors 


270000 


Die Size (Pad to Pad) 


8.56mm x 8.35mm 


Transistor Density 


3800 T/sq mm 


Power Dissipation 


0.5 Watts 


Package 


68CLCX; 


Roating-point Perfonmance 


20 MaOPS @ 20 MHz 


Design Style 


Full Ci^m 



TABLE 2 



The perfonnanca attained by the apparatus of tfie second embodiment for a 
range of applications is shown in Table 3. 
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Appiication 


Exec. Time 


Perfomnance 


3450 point 1 D Fourier transform 
using 2D factorizaton 


20 msec 


130 Maops 


2D Fourier Transform of 380380 
point image 


1385 msec 


66 MFLOPS 


4000 tap RR Fitter 


35 msec per 1000 
data sampies 


210MFLOPS 


1 0th order Matrix polynomial 
evaluation of 60 x 60 complex 
matrix 


136 msec 


114 MFLOPS 


QR feclorlzatlon of 59 x 60 Matrix 


561 n^ec 


87 MFLOPS 



TABLES 
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1 . A processing element suitable for use in a sc^abie array processor 
comprising of at least one input register means adapted to receive and process 
serial operands in the form of {instruction, data} 2-tupies, a memory mearts 
adapted to store temporary results and constante, a computing means adapted 
5 to perform logical operations, an ou^ut register nDeans adapted to output 
reajlts from the processing element, a control and sequencing means adapted 
to control the operation of the processing element; 
a plurality of data buses adapted to pro\rtde communication between the 
plurality of means. 

10 2. An apparatus as in claim 1 wherein the computing logical means 

consists of a sWfter/normaiiser means adapted to shift/nomialise data and an 
arithmetic means adapted to perform logical operations such as but not limited 
to addition, subtraction and partial multiplication operations. 

3. An apparatus as In ci^m 1 wherein the proces^ng element Is adapted 
1 5 to perfonn floating point multiply, floating point add and ftoating point muWply- 

accumulate wNch can be used for Inner product accumulate operations. 

4. An apparatus as In dalm 1 wherein the input register is adapted to ou^ut 
a copy of an input operand bit witii a one clock period delay. 

5. An apparatus as in daim 1 wherein tiiere are N Input registers and ttie 
20 processing element is suitable for use In a N-dimensionai scalable anay 

processor. 

6. An apparatus as in claim 5 wherein N can be any positive Integer. 

7. An apparatus as in dsdm 1 wherein tiie input registers are adapted to 
convert tiie input seri^ data to an internal representation comprising separate 

25 sign, fraction and exponent. 

8. An apparatus as in dalm 1 vi*»erein tiie memory means consists of read 
only memory for storage of constants and a read/write memory for storage of 
temporary results. 

9. M An apparatus as in claim 2 wherein tii© shlfter/nomrialteer means Is 
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adapted to perform binary weighted barrel sNftirtg wtierein the shifter functidn is 
determined by a control input to the shifter/nonnaiizer and the normaiizer 
function effects a data dependent shift of up to 15 bits wttNn a single ciocl( 
cycle. 

5 10, An apparatus as in claim 2 wherein the arithmetic means implements 
logical operations such as but not limited to floating point addition, multiplication 
and multiply-accumuiate algorithms using a parallel microcoded data path. 

11. An apparatus as In claim 2 wherein the arithmetic means comprises a 
logical unit such as but not limited to an input-muitiplexer, an adder, an output 

1 0 shifter, a flags unit and a control unit 

12. An apparatus as in daim 1 wherein the output register is adapted to be 
loaded in parts to enable the conversion from the internal representation to 
IEEE 754 floating point format and the output register is adapted to be parallel 
loaded from the sulthmetic means or serially loaded from a serial source and 

15 the register is unloaded serlaJly. 

13. An apparatus as in claim 1 wherein the control and sequencing means 
includes timing and confrol ioglc, a microcode ROM, address decoders, branch 
control logic, flags logic, instmcflon register, instruction decoder and a program 
counter. 

20 14. An apparatus as in daim 1 wherein there are ttiree data buses, an X bus, 
a Y bus and a R bus and where the X and Y buses are called operand buses 
and the R bus is called the result bus. 

15. An apparatijs as in claim 1 wherein there is an accumulator comparison 
means. 

25 16. A scalable anay processor chip comprising an array of processing 
elements eadi said element induding at least one input register means 
adapted to receive aiKi process serial operands in me form of {instruction, d^} 
2>tuples, a memory means adapted to store terrporary results and constate, a 
shifter/normalizer means adapted to shift or normalize data, an arithmeflc 

30 means adapted to perform logical operati(xi$ such as but not limited to addition, 
subj^action and partial multiplication operaiions, m output register means 
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adapted to output results from the processing element, a control and 
sequencing means adapted to control the operation of the processing element, 
a plurality of data buses adapted to provide communication between the 
plumiity of means and wrfierein each proces^ng element has means for 
5 communication only with adjacent elements. 

17. An apparatus as In claim 16 wherein the array of elements (ximprise an 
interooreiected latSca of at least one dimension. 

18. An apparatus as In claim 1 6 wherein tfie array of elements comprise an 
interconnected lattice of at least two dimensions. 

10 19. An apparatus as in claim 1 6 wherein the scalable anay processor chip is 
adapted to perform at least the functicms of computing the product of two or 
more matrices, computing tfie element-wise product of two or more matrices, 
computing the sum of two or more matrices, pemnuting the rows and columns of 
a matrix and transposing a matrix. 

15 20. A computing apparatus comprising a host processor, at least one 
scalable array processor chip and a plurality of data formatters wherein the 
scalable array processor chip(8) and plurality of data formatters are adapted to 
perlbnn matrix operations othenvise peifbnned by the host processor. 

21 . An apparatus as In claim 20 wherein the apparatus includes a memory 
20 cad>e adapted to storB operand data and temporary or Intemnediate results. 

22. A method of performing matrix operations comprising the steps of : 

(a) providing a plurality of processing elements in the fomr» of an array 
adapted to perfomr) systolic processing operations; 

(b) receiving operand matrix data for processing from a host or data source; 
25 (c) fomnatting the operand matrix data in a data fonmatter by adding an 

instruction to fonm an {instruction, data) 2-tuple; 

(d) trar^sferring sets of 2-tuples to the processing element array to cause the 
processing elements to process the data in accordance with ttie instnjction; 

(e) repeating the steps (b) to (d) a number of times said number of times 
30 being dictated by the matrix operation being performed; 

(f) unloading the results of the matrix operations Into an output result 
reg^ter of tiie processing elements (urKier control of an instruction specified by 
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the operand 2-tuple); 

(g) transferring the contents of the output result registers held within tiie 
plu^ty of processing elements bade to data f omtatlers as result wavefronts; 

(h) storing the result wavefront data back to a host or data sink; and 

5 (i) repeating the steps (f) to (h) a nunr^ of tones said number of times 
being dictated by Une matrix operation beir^ perfomned. 
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